Explore the structure and the dimensions of the data and describe the dataset briefly. Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)
'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : int 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
[1] 506 14
In standardization means of all variables are in zero. That is, variables have distributed around zero.
crim zn indus
Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
Median :-0.390280 Median :-0.48724 Median :-0.2109
Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
chas nox rm age
Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
dis rad tax ptratio
Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
black lstat medv
Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
Median : 0.3808 Median :-0.1811 Median :-0.1449
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
[1] "matrix"
crime
low med_low med_high high
127 126 126 127
Call:
lda(crime ~ ., data = train)
Prior probabilities of groups:
low med_low med_high high
0.2623762 0.2376238 0.2549505 0.2450495
Group means:
zn indus chas nox rm
low 0.98130174 -0.9066431 -0.123759247 -0.8840549 0.4727977
med_low -0.09732623 -0.2797183 0.014751158 -0.5526505 -0.1612113
med_high -0.37817407 0.2234619 0.224586496 0.4283399 0.1858054
high -0.48724019 1.0171737 0.006051757 1.0425648 -0.4943251
age dis rad tax ptratio
low -0.8799711 0.8776391 -0.6925875 -0.7290337 -0.48886393
med_low -0.2726170 0.3687760 -0.5583740 -0.4744113 -0.04393502
med_high 0.4031911 -0.4143320 -0.3853374 -0.2851164 -0.36468085
high 0.8013977 -0.8371692 1.6375616 1.5136504 0.78011702
black lstat medv
low 0.3808634 -0.78888000 0.56746998
med_low 0.3145534 -0.08442091 -0.03086276
med_high 0.0692395 -0.07036499 0.24862295
high -0.7205699 0.90561295 -0.66275564
Coefficients of linear discriminants:
LD1 LD2 LD3
zn 0.090717939 0.641865637 -0.8647445
indus -0.009184624 -0.249613915 0.1627299
chas -0.077574894 -0.031438525 0.1405313
nox 0.384221239 -0.732333796 -1.2477158
rm -0.101302748 -0.096585547 -0.1739006
age 0.281926343 -0.306344858 0.0215151
dis -0.099435704 -0.148680544 0.2346623
rad 2.987569110 0.849616989 -0.2434196
tax 0.011829136 0.073820136 0.5454135
ptratio 0.110676590 0.045970115 -0.1260826
black -0.118508965 0.006123112 0.1066025
lstat 0.201672706 -0.078432106 0.5001612
medv 0.181386179 -0.275042300 -0.1260297
Proportion of trace:
LD1 LD2 LD3
0.9424 0.0422 0.0154
predicted
correct low med_low med_high high
low 11 10 0 0
med_low 10 17 3 0
med_high 0 10 13 0
high 0 0 0 28
crim zn indus
Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
Median :-0.390280 Median :-0.48724 Median :-0.2109
Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
chas nox rm age
Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
dis rad tax ptratio
Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
black lstat medv
Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
Median : 0.3808 Median :-0.1811 Median :-0.1449
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
[1] "matrix"
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.119 85.624 170.539 226.315 371.950 626.047
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.016 149.145 279.505 342.899 509.707 1198.265
Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters?
Super-Bonus: Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities?
[1] 404 13
[1] 13 3